This notebook highlights an analysis of Charlotte-Mecklenburg traffic patterns for November-December 2018 and contains over 3000 accidents of the sample dataset.
Project goals include using ensemble learning to produce a model to highlight possible known accident areas for better public service response units.
Disclaimer “Accidents” included in these results include both traffic control/function issue type accidents and actual accident events.
Data information: All data is obtained through open sources including Charlotte-Mecklenburg open data portal and NCDOT GIS data. Dates for datasets include as recent information as possible including 2013-2017 datasets for traffic volumes, census information, etc.
library(tidyverse)
library(jsonlite)
library(rgeos)
library(plotly)
library(lubridate)
library(sp)
library(geosphere)
library(reshape2)
library(randomForest)
accidents <- fromJSON("accidents.json", flatten = TRUE)
# TODO: add holdout validation set
#holdout_set <- fromJSON("holdout.json", flatten = TRUE)
census_income <- read_csv("census_household_income.csv")
census_pop <- read_csv("census_population_groups.csv")
traffic_signals <- read_csv("traffic_signals.csv")
traffic_volumes <- read_csv("traffic_volumes.csv")
state_roads <- read_csv("state_maintained_roads.csv")
Few steps we need to do before initial analysis:
Clean/remove unneeded features
Plot relationships of coordinates (traffic signals, census group coordinates)
Combine datasets into a single source (with all others)
Since the census group coordinates are extremely similar for household income and population we can combine these coordinates and use them as the same for either dataset.
census_income <- census_income %>%
rename(
med_household_income = Median_Household_Income,
fam_poverty_rate = FamilyPovertyRate
) %>%
select(
coordinates,
med_household_income,
fam_poverty_rate
)
census_pop <- census_pop %>%
rename(
population = Population,
pop_sq_mile = PopSqMi,
num_adolescents = Age0_17,
num_young_adults = Age18_34,
num_adults = Age35_64,
num_seniors = Age65andOver,
median_age = MedianAge
) %>%
select(
coordinates,
population,
pop_sq_mile,
num_adolescents,
num_young_adults,
num_adults,
num_seniors,
median_age
)
Combine census keys (coordinates) since we are referring to the same areas, combine these datasets and their corresponding information
We also have some corrupted data upon the join, so cleanup and replace zero values with mean of the specific column the data is in
census_information <-
left_join(census_income, census_pop, by = "coordinates")
census_information <- census_information %>%
mutate(med_household_income = ifelse(is.na(med_household_income), mean(med_household_income, na.rm=TRUE), med_household_income)) %>%
mutate(med_household_income = ifelse(med_household_income == 0, mean(med_household_income, na.rm=TRUE), med_household_income)) %>%
mutate(fam_poverty_rate = ifelse(ceiling(fam_poverty_rate) == 0, mean(fam_poverty_rate, na.rm=TRUE), fam_poverty_rate)) %>%
mutate(median_age = ifelse(ceiling(median_age) == 0, mean(median_age, na.rm=TRUE), median_age))
# Add ID column for tagging to poly plots as foreign key
census_information["new.census_key"] <- as.numeric(rownames(census_information))
Change data types to double for coordinate values
accidents$latitude <- as.double(accidents$latitude)
accidents$longitude <- as.double(accidents$longitude)
Combine census_information with the particular accident record in question
Join based on the census track group the accident occurred in. If the accident coordinates are within the specific coordinate polygon then tag the accident to that particular census group
final_polys <- c()
for(row in seq(from=1, to=nrow(census_information))) {
# turn each row of coordinates into a list of doubles
coord_list <- as.list(strsplit(census_information$coordinates[row], ",")[[1]])
coord_num_list <- as.double(as.character(unlist(coord_list)))
# populate latitudes/longitudes as lists separately over every other item in coordlist
coord_lats <- c()
coord_lngs <- c()
for(i in seq(from=1, to=length(coord_num_list), by=2)) {
coord_lats <- c(coord_lats, coord_num_list[i] )
}
for(i in seq(from=0, to=length(coord_num_list), by=2)) {
coord_lngs <- c(coord_lngs, coord_num_list[i] )
}
# create polygon
poly1 <- Polygons(list(Polygon(cbind(coord_lats, coord_lngs))),'1')
# add as spatialpolygon
spatial_poly <- SpatialPolygons(list(poly1))
final_polys <- c(final_polys, spatial_poly)
}
accidents["new.census_key"] <- NA
for(i in seq(from=1, to=length(final_polys))) {
# Tag each accident coordinates over() if in spatial poly, then tag with that poly id
accident_coords <- data.frame(Longitude = accidents$longitude, Latitude = accidents$latitude,
names = accidents$address)
coordinates(accident_coords) <- ~ Longitude + Latitude
proj4string(accident_coords) <- proj4string(final_polys[[i]])
tag <- over(accident_coords, final_polys[[i]])
tagged_indexes <- as.numeric(names(tag[!is.na(tag)]))
accidents$new.census_key[tagged_indexes] <- i
}
Join the plotted foreign key for census information
accidents <- merge(accidents, census_information, by = "new.census_key")
accidents$coordinates <- NULL # drop un-needed column from join with census_block
accidents <- accidents[!duplicated(as.list(accidents))] # remove duplicate columns
Time based data:
Reshape and add month/day/hour as separate features for more drill-down data
accidents$datetime_add <- as.POSIXct(accidents$datetime_add, format="%Y-%m-%dT%H:%M:%OS")
accidents["new.month"] <- month(accidents$datetime_add)
accidents["new.day"] <- day(accidents$datetime_add)
accidents["new.hour"] <- hour(accidents$datetime_add)
accidents["new.minute"] <- format(accidents$datetime_add, "%H:%M")
Road names:
Reshape and factorize distinct roads from addresses, substitute and trim characters to find levels Replace invalid/corrupt road names (&) with NA
Road intersect names: Grab the actual address of the accident from the police response record
# Create variations of road names in order to provide insight for tagging volumes and other attributes
accidents["new.road"] <- factor(gsub('[^[:alpha:]]+', '', accidents$address))
accidents$new.road[accidents$new.road == ""] <- NA
accidents.addresses <- accidents[accidents$address != "&", "address"]
accidents["new.road_alpha"] <- factor(gsub('[^[:alnum:][:space:]]+', '', accidents.addresses))
accidents[grep("^\\s*$", accidents$new.road_alpha), "address"] <- NA
accidents[grep("^\\s*$", accidents$new.road_alpha), "new.road_alpha"] <- NA
accidents["new.road_text"] <- gsub('[^[:alpha:][:space:]]+', '', accidents.addresses)
# Grab first road in the combination of address and match the known traffic volume for the general area
accidents["new.road_short"] <- NA
first_roads <- sapply(strsplit(accidents$new.road_text, " "), function(x)x[which.max(nchar(x))])
first_roads <- unlist(lapply(first_roads, function (x) ifelse(length (x) > 0, x, NA)))
accidents["new.road_short"] <- as.factor(first_roads)
Traffic Signal proximity
# Get the min distance between comparing each accident to the known traffic signals in Mecklenburg area, the min will be the closest traffic signal to the accident in question
final_dists <- c()
for(i in seq(nrow(accidents))) {
dists <- distm(x = c(accidents$longitude[i], accidents$latitude[i]),
y = traffic_signals[, c("X", "Y")],
fun = distHaversine)
min_dist <- min(dists)
final_dists <- c(final_dists, min_dist)
}
accidents["new.signal_proximity"] <- final_dists
Street Speed limits and assign road to a specific speed limit as key
There is some data regarding speed limits for certain roads (primary, highways, etc.) however not all roads have speed limit data. To solve for this we will insert generic speed limits for common road names.
If the road is a highway, we will check if it is an interstate and assign. If the road is not a highway or road, we will assign a generic 35 speed limit.
This should account for most speed limits, and a slight difference should not have that great of an impact on our dataset.
An alternative option if interested: Use each accident coordinates in a way to match up with Google Maps Roads API to get the speed limit of the road in question, this requires creating a Google Cloud Platform account and all the details regarding it.
# Assign a generic speed limit to roads based on unique road names extracted
# HWY will be assigned a state speed limit of 55, inner/outer freeways 70, all others 35
extract_speed <- function(road) {
road <- as.character(road)
if(length(grep("HY", road))) {
if(length(grep("I77", road))) {
return (70)
} else if(length(grep("I85", road))) {
return (70)
} else if(length(grep("I485", road))) {
return (70)
} else {
return (55)
}
} else if(length(grep("RD", road))) {
return (45)
} else {
return (35)
}
}
speeds <- NULL
for (i in 1:nrow(accidents)) {
speeds <- c(speeds, extract_speed(accidents[i,"new.road"]))
}
accidents["new.speed_limit"] <- as.factor(speeds)
Factors for roads: Our current road feature needs some downsizing. We current have almost 2000 unique road name factors, a few options:
We will try approaches for #1 and #3
# Tag a movement score to the road in question using Mecklenburg County traffic volumes
mecklenburg_volumes <- traffic_volumes %>%
filter(COUNTY == 'MECKLENBURG')
# For each unique set of "routes" we will average the volumes to makeup a count for each route in general
unique_routes <- unique(mecklenburg_volumes$ROUTE)
mecklenburg_volumes["new.mean_vol"] <- NA
for(i in seq(1:length(unique_routes))) {
route_data <- mecklenburg_volumes[mecklenburg_volumes$ROUTE == unique_routes[i],"2016"]
mean_vol <- lapply(route_data, mean, na.rm = TRUE)
mecklenburg_volumes[mecklenburg_volumes$ROUTE == unique_routes[i],"new.mean_vol"] <- mean_vol
}
# Tag accidents with traffic volumes, replace corrupt with average
accidents["new.mean_vol"] <- NA
for(i in seq(1:nrow(accidents))) {
mean_vol <- mecklenburg_volumes[grepl(as.character(accidents[i,"new.road_short"]), mecklenburg_volumes$ROUTE),"new.mean_vol"]
accidents[i,"new.mean_vol"] <- as.double(mean_vol$new.mean_vol[1])
}
accidents <- accidents %>%
mutate(new.mean_vol = ifelse(is.na(new.mean_vol), mean(new.mean_vol, na.rm=TRUE), new.mean_vol))
accidents$new.mean_vol <- round(accidents$new.mean_vol, 0)
Final select and cleanup
# Replace NA numerics as 0 such as rainfall, snowfall, etc. for continuous variables
accidents$weatherInfo.rain.1h[is.na(accidents$weatherInfo.rain.1h)] <- 0
accidents$weatherInfo.snow.1h[is.na(accidents$weatherInfo.snow.1h)] <- 0
accidents$weatherInfo.visibility[is.na(accidents$weatherInfo.visibility)] <- 0
accidents$weatherInfo.main.sea_level[is.na(accidents$weatherInfo.main.sea_level)] <- 0
accidents$weatherInfo.main.grnd_level[is.na(accidents$weatherInfo.main.grnd_level)] <- 0
accidents$weatherInfo.wind.gust[is.na(accidents$weatherInfo.wind.gust)] <- 0
accidents$weatherInfo.rain.3h[is.na(accidents$weatherInfo.rain.3h)] <- 0
# Convert categoricals as factors
accidents$division <- as.factor(accidents$division)
accidents$weatherInfo.name <- as.factor(accidents$weatherInfo.name)
accidents <- accidents %>%
rename(
id = `_id.$oid`,
new.area = `weatherInfo.name`
) %>% select(
id,
event_no,
datetime_add,
division,
address,
event_type,
latitude,
longitude,
new.road,
new.speed_limit,
new.signal_proximity,
new.mean_vol,
new.month,
new.day,
new.hour,
new.minute,
new.area,
weatherInfo.weather,
weatherInfo.cod,
weatherInfo.visibility,
weatherInfo.main.temp,
weatherInfo.main.pressure,
weatherInfo.main.humidity,
weatherInfo.main.sea_level,
weatherInfo.main.grnd_level,
weatherInfo.wind.speed,
weatherInfo.wind.gust,
weatherInfo.clouds.all,
weatherInfo.sys.sunrise,
weatherInfo.sys.sunset,
weatherInfo.rain.3h,
weatherInfo.rain.1h,
weatherInfo.snow.1h,
med_household_income,
fam_poverty_rate,
population,
pop_sq_mile,
num_adolescents,
num_young_adults,
num_adults,
num_seniors,
median_age
)
Initial scattermap of the accidents with accident type as a visual to display intensity of problem areas
dist_accidents <- accidents %>%
group_by(division, event_type) %>%
summarise(n = n()) %>%
arrange(n)
road_accidents <- accidents %>%
group_by(new.road) %>%
summarise(n = n()) %>%
filter(!is.na(new.road)) %>%
top_n(5) %>%
arrange(n)
hour_accidents <- accidents %>%
group_by(new.hour) %>%
summarise(n = n()) %>%
arrange(n)
hour_accidents_types <- accidents %>%
group_by(new.hour, event_type, division) %>%
summarise(n = n()) %>%
arrange(n)
age_accidents <- accidents %>%
group_by(median_age, event_type) %>%
summarise(n = n()) %>%
filter(median_age > 15 && median_age < 50) %>%
arrange(n)
A few initial graphs detailing traffic patterns in the area and relationships between various features
ggplot(dist_accidents, aes(x = reorder(division,n), y = n, fill = event_type)) +
geom_col() +
labs(title="Counts of accidents by division") +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
ggplot(road_accidents, aes(x = reorder(new.road, n), y = n)) +
geom_col() +
labs(title="Counts of accidents by road/intersection") +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
ggplot(accidents, aes(x=median_age, fill = event_type)) +
geom_histogram(bins = 20)
ggplot(accidents, aes(x = weatherInfo.rain.1h, y = weatherInfo.visibility, color = event_type)) +
geom_point() +
labs(title="Relationship between rainfall, visibility, and accidents")
ggplot(accidents, aes(sample = weatherInfo.wind.speed)) +
stat_qq() +
facet_wrap( ~ event_type, nrow = 1) +
labs(title="Distribution of Accident Types and Wind Speed")
ggplot(hour_accidents, aes(x = new.hour, y = n)) +
geom_line(color="red") +
scale_x_continuous(breaks = pretty(accidents$new.hour, n = 10)) +
geom_point() +
labs(title="Hourly traffic patterns grouped by counts for each day")
ggplot(accidents, aes(x = new.hour, fill = event_type)) +
geom_histogram(bins = 10) +
labs(title="Hourly Traffic patterns grouped by counts for event type, division") +
theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
scale_x_continuous(breaks = pretty(accidents$new.hour, n = 10))
ggplot(accidents, aes(x=med_household_income, fill = event_type)) +
geom_histogram(bins = 10)
ggplot(accidents, aes(x=population)) +
geom_histogram(bins = 10)
ggplot(accidents, aes(x=new.signal_proximity)) +
geom_histogram(bins = 10)
ggplot(accidents, aes(x=new.mean_vol)) +
geom_histogram(bins = 10)
Few points from the following graphics:
The top divisions for accidents include University City, Steele Creek, Central, and South by a substantial amount
The top road for accident counts include N Tryon St by far
Age census information does not necessarily represent an accurate age count, though the median age for most accidents include those in the 30-40 range by far
The obvious relationship between rainfall and visibility is apparent in the plots of accidents, only a slight increase of rainfall increase results in many more accidents with low visibility
There is no apparent relationship between wind speed and accident type, though a hypothesis would be increased wind would cause more TF (traffic function) type events
Based on time series the highest amount of accidents occur at 6:00 PM EST time on the dot.
Time series also indicates slight decrease troughs between lunch time and 5:00, this might represent people getting off of work at certain shifts (12, 3, 4)
Based on census income it shows there is a slight increase in counts of hit-and-runs in lower income bins, a hypothesis could be that these individuals are already struggling so the chance of escaping a potential negative cost could outweigh being stuck with more expenses
The vast majority of accidents occur within < 500-1000 meters from a traffic signal yet this makes sense considering Charlotte Meck. is an urban area with many signal areas even close to interstate highway ramps
Correlation matrix for only numeric features, #TODO: transform categorical features to correct numeric representations
accidents_matrix <- accidents %>%
select_if(is.numeric) %>%
cor(.)
melted_acc <- melt(accidents_matrix)
ggplot(data = melted_acc, aes(x=Var1, y=Var2, fill=value)) +
geom_tile() +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
For creating training data, we will pick a random date/time that is applicable during normal hours, if this does not contain an accident we will place this in our non-accident bin. We should ensure our proportions are correct in that we should have roughly less accidents than our non-accidents as to preserve realism in our predictions.
Generate non-accident training data and test data
Select a random accident record, alter the road, hour of the day, and the day of the year, if the record is not present we will add to our training set of non-accidents
START_DATE <- min(accidents$datetime_add)
END_DATE <- max(accidents$datetime_add)
TRAIN_SIZE = 10000
ITERATIONS = 2
# generate random non-accidents by altering hour, day, month, and road, if no accident add to our train
non_accidents <- NULL
for(iteration in seq(1:ITERATIONS)) {
for(i in seq(1:nrow(accidents))) {
recc <- accidents[i,]
test_date <- as.POSIXct(sample(as.numeric(START_DATE): as.numeric(END_DATE), 1, replace = T), origin = '1970-01-01')
t_month <- month(test_date)
t_day <- day(test_date)
t_hour <- hour(test_date)
t_minute <- format(test_date, "%H:%M")
t_road <- sample(accidents$new.road, 1)
acc_rec <- accidents[(accidents$new.hour == t_hour) &&
(accidents$new.day == t_day) &&
(accidents$new.month == t_month) &&
(accidents$new.road == t_road), ]
if(nrow(acc_rec) != 0) {
print("accident found")
next;
} else {
# Add to train set from modified record
recc$new.hour <- t_hour
recc$new.day <- t_day
recc$new.month <- t_month
recc$new.road <- t_road
recc$new.minute <- t_minute
recc$datetime_add <- test_date
non_accidents <- rbind(recc, non_accidents)
}
}
}
Sanity check to ensure non_accidents indeed did not occur, then combine non_accidents with accidents to produce our training set
train_set_check <- accidents[accidents$datetime_add == non_accidents$datetime_add,]
nrow(train_set_check)
## [1] 0
accidents["is_accident"] <- 1
non_accidents["is_accident"] <- 0
train <- rbind(accidents, non_accidents)
rf.train.1 <- train[1:nrow(train), c(
"new.hour",
"new.day",
"new.month",
"division",
"new.signal_proximity",
"new.area",
"new.speed_limit",
"new.mean_vol",
"weatherInfo.visibility",
"weatherInfo.main.temp",
"weatherInfo.wind.speed",
"weatherInfo.clouds.all",
"weatherInfo.rain.1h",
"weatherInfo.snow.1h",
"pop_sq_mile",
"median_age"
)
]
rf.label <- as.factor(train$is_accident)
set.seed(1234)
rf.1 <- randomForest(x = rf.train.1, y = rf.label, importance = TRUE, ntree = 1000)
plot(varImpPlot(rf.1))
Based on observations of an initial train model, hour/day is the single biggest predictor of all features
As we saw previously larger concentrations of accidents due occur with bad weather (temp, clouds, rain)
An odd observation is the initial model shows little importance for speed limit or signal proximity compared to other values. This is most likely due to the fact that many roads in our dataset have similar speed limits as well as the fact that most accidents occur within a range of a traffic signal anyway. We might have to do some scaling with these features.
Based on just the following data gathered so far my suspicion is this model is highly overfitted to hour/day features and we should probably generate and/or gather better non-accident data to supplement our training set.
Some items to update:
Bin roads into correct factors (right now we are not taking into account specific road involved in the accident this will be a huge feature for us, currently we cannot use our 500+ unique shortened road names)
Add better supplemental training data for non-accidents, change various other features and include bigger training set
Combine/understand features or drop unneeded features